Skip to content

Conversation

@fangchenli
Copy link
Contributor

@fangchenli fangchenli commented Jan 3, 2026

What changes were proposed in this pull request?

This PR implements unified type coercion for Arrow-backed Python UDF. It adds a CoercionPolicy option to the config and the coerce() methods to DataType classes to control type conversion behavior when Arrow optimization is enabled.

STRICT vs PERMISSIVE Mode Benchmark Results (100K values)

Type STRICT PERMISSIVE Overhead %
IntegerType 2.08ms 7.78ms 5.70ms 273.9%
LongType 1.92ms 7.60ms 5.68ms 295.8%
DoubleType 2.08ms 5.58ms 3.50ms 167.9%
StringType 3.60ms 14.85ms 11.25ms 312.2%
BooleanType 1.41ms 4.70ms 3.29ms 232.8%
DateType 3.23ms 30.88ms 27.65ms 855.6%
TimestampType 4.55ms 34.29ms 29.74ms 653.7%
DecimalType(18,2) 39.75ms 45.63ms 5.88ms 14.8%
BinaryType 1.61ms 5.71ms 4.10ms 254.3%
ArrayType(IntegerType) 7.41ms 56.00ms 48.59ms 656.0%
ArrayType(LongType) 7.41ms 13.23ms 5.82ms 78.6%
MapType(StringType,IntegerType) 19.50ms 103.80ms 84.30ms 432.4%
MapType(LongType,LongType) 12.29ms 45.09ms 32.80ms 266.8%
StructType(int,string) 8.62ms 46.80ms 38.18ms 443.0%
StructType(long,long) 8.82ms 13.92ms 5.10ms 57.8%
NestedStruct 12.04ms 61.35ms 49.31ms 409.7%
ArrayOfStruct 15.00ms 110.50ms 95.50ms 636.8%

Why are the changes needed?

When Arrow optimization is enabled for Python UDFs, the type coercion behavior differs from that of the pickle-based UDFs. We need to control the coercion behavior for backward compatibility and to help migrations.

Does this PR introduce any user-facing change?

Yes. A new configuration spark.sql.execution.pythonUDF.coercion.policy is added with three options:

  • PERMISSIVE: Matches pickle behavior - returns None for most type mismatches
  • WARN: Same as PERMISSIVE but logs warnings when Arrow would behave differently
  • STRICT: Arrow handles type conversion natively (original Arrow behavior)

With PERMISSIVE (default), users can enable Arrow optimization without behavior changes.

How was this patch tested?

Both unit tests and integration tests were added

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.5

@github-actions
Copy link

github-actions bot commented Jan 3, 2026

JIRA Issue Information

=== Sub-task SPARK-54891 ===
Summary: Unified type coercion for Arrow-backed Python UDFs
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@fangchenli fangchenli changed the title [WIP][SPARK-54891](https://issues.apache.org/jira/browse/SPARK-54891) Unified type coercion for Arrow-backed Python UDFs [WIP][SPARK-54891]Unified type coercion for Arrow-backed Python UDFs Jan 3, 2026
@fangchenli fangchenli changed the title [WIP][SPARK-54891]Unified type coercion for Arrow-backed Python UDFs [WIP][SPARK-54891] Unified type coercion for Arrow-backed Python UDFs Jan 3, 2026
@fangchenli fangchenli marked this pull request as ready for review January 4, 2026 01:59
@fangchenli fangchenli changed the title [WIP][SPARK-54891] Unified type coercion for Arrow-backed Python UDFs [SPARK-54891] Unified type coercion for Arrow-backed Python UDFs Jan 4, 2026
Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fangchenli thanks for working on this.
There are 3 known different behaviors of python udf:
1, vanilla python udf, based on pickle;
2, arrow-optimized python udf (with legacy pandas conversion), with useArrow=True or spark.sql.execution.pythonUDF.arrow.enabled=True;
3, arrow-optimized python udf (without legacy pandas conversion), with 2 and spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled=False;

it seems this PR is for 2, I personally think it is a good idea if we can eliminate the behavior differences in both 1 vs 2 and 1 vs 3.

return wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf)


def wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code path is for arrow-optimized python udf without legcay pandas conversion, there is another path wrap_arrow_batch_udf_legacy for python udf with legcay pandas conversion

@gaogaotiantian
Copy link
Contributor

I have some detail suggestions on the code itself but I want to wait until we have a further discussion about this feature (just to avoid wasting our time).

The current implementation (or the concept as its own) will introduce a non-trivial overhead for data conversion. For simple and common types like integer, bytes, strings, we introduced multiple python-level function calls to each element, which could result in a perf regression - to default behavior.

The implementation is not complete either I think? We need to handle this in all container types I assume?

I'd suggest that we discuss whether we want to implement this - if the benefit trumps the overhead in both perf and maintenance, before we start reviewing the code itself.

@fangchenli
Copy link
Contributor Author

I'd suggest that we discuss whether we want to implement this - if the benefit trumps the overhead in both perf and maintenance, before we start reviewing the code itself.

Thanks for the feedback. I'll benchmark it to determine the overhead.

fangchenli and others added 16 commits January 14, 2026 19:26
This commit implements the STRICT policy as a no-op that lets Arrow
handle type conversion natively, while PERMISSIVE/WARN policies
implement pickle-compatible coercion behavior.

Changes:
- worker.py: Add conditional logic so STRICT skips coercion entirely
- types.py: Update all coerce() methods to return value unchanged for STRICT
- test_coercion.py: Update unit tests to verify STRICT no-op behavior
- test_arrow_udf_coercion.py: Add integration tests comparing policies

The integration tests verify:
- PERMISSIVE matches pickle behavior exactly
- WARN produces same results as PERMISSIVE
- STRICT produces different results (Arrow's aggressive conversion)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@github-actions github-actions bot added the BUILD label Jan 15, 2026
@fangchenli fangchenli force-pushed the unified-type-coercion branch from 33d8af8 to b30e1e6 Compare January 15, 2026 04:41
@fangchenli
Copy link
Contributor Author

Close for now since coercion adds significant overhead.

@fangchenli fangchenli closed this Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants